Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Compression of scan-digitized Indian language printed text: A soft pattern matching technique

Identifieur interne : 001802 ( Main/Exploration ); précédent : 001801; suivant : 001803

Compression of scan-digitized Indian language printed text: A soft pattern matching technique

Auteurs : U. Garain [Inde] ; S. Debnath [Inde] ; A. Mandal [Inde] ; Bidyut Baran Chaudhuri [Inde]

Source :

RBID : Pascal:05-0040310

Descripteurs français

English descriptors

Abstract

In this paper, a new compression scheme is presented for Indian Language (IL) textual document images. Since OCR (Optical Character Recognition) technology for IL scripts is not matured enough, transcription of these documents into digital domain needs new techniques that achieve high degree of compression as well as suitable methods to perform various operations like document indexing, retrieval, etc. The proposed method is essentially based on symbolic compression technique, which has been realized with an efficient segmentation-based clustering approach. A soft pattern-matching technique has been implemented using two different feature sets that co-operate each other to build an efficient prototype library. Experiments have been done for documents printed in Devnagari (Hindi) and Bangla scripts, two mostly used script in Indian subcontinent. Test results show that the proposed technique outperforms several standard methods like CCITT Group-4, JBIG, etc. which are frequently used for compression of document images.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Compression of scan-digitized Indian language printed text: A soft pattern matching technique</title>
<author>
<name sortKey="Garain, U" sort="Garain, U" uniqKey="Garain U" first="U." last="Garain">U. Garain</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Indian Statistical Institute, 203, B. T. Road</s1>
<s2>Kolkata 700 108</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Kolkata 700 108</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Debnath, S" sort="Debnath, S" uniqKey="Debnath S" first="S." last="Debnath">S. Debnath</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Regional Engineering College </s1>
<s2>Durgapur, West Bengal</s2>
<s3>IND</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Regional Engineering College </wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Mandal, A" sort="Mandal, A" uniqKey="Mandal A" first="A." last="Mandal">A. Mandal</name>
<affiliation wicri:level="1">
<inist:fA14 i1="03">
<s1>Defense Research & Development Organization</s1>
<s2>Pune</s2>
<s3>IND</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Defense Research & Development Organization</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Chaudhuri, B B" sort="Chaudhuri, B B" uniqKey="Chaudhuri B" first="B. B." last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Indian Statistical Institute, 203, B. T. Road</s1>
<s2>Kolkata 700 108</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Kolkata 700 108</wicri:noRegion>
<placeName>
<settlement type="city">Calcutta</settlement>
<region type="province">Bengale-Occidental</region>
</placeName>
<orgName type="lab" n="5">Institut indien de statistiques</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">05-0040310</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 05-0040310 INIST</idno>
<idno type="RBID">Pascal:05-0040310</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000492</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000297</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000567</idno>
<idno type="wicri:Area/Main/Merge">001881</idno>
<idno type="wicri:Area/Main/Curation">001802</idno>
<idno type="wicri:Area/Main/Exploration">001802</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Compression of scan-digitized Indian language printed text: A soft pattern matching technique</title>
<author>
<name sortKey="Garain, U" sort="Garain, U" uniqKey="Garain U" first="U." last="Garain">U. Garain</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Indian Statistical Institute, 203, B. T. Road</s1>
<s2>Kolkata 700 108</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Kolkata 700 108</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Debnath, S" sort="Debnath, S" uniqKey="Debnath S" first="S." last="Debnath">S. Debnath</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Regional Engineering College </s1>
<s2>Durgapur, West Bengal</s2>
<s3>IND</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Regional Engineering College </wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Mandal, A" sort="Mandal, A" uniqKey="Mandal A" first="A." last="Mandal">A. Mandal</name>
<affiliation wicri:level="1">
<inist:fA14 i1="03">
<s1>Defense Research & Development Organization</s1>
<s2>Pune</s2>
<s3>IND</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Defense Research & Development Organization</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Chaudhuri, B B" sort="Chaudhuri, B B" uniqKey="Chaudhuri B" first="B. B." last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Indian Statistical Institute, 203, B. T. Road</s1>
<s2>Kolkata 700 108</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Kolkata 700 108</wicri:noRegion>
<placeName>
<settlement type="city">Calcutta</settlement>
<region type="province">Bengale-Occidental</region>
</placeName>
<orgName type="lab" n="5">Institut indien de statistiques</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithm</term>
<term>Data compression</term>
<term>Image document</term>
<term>Indian</term>
<term>Language</term>
<term>Printed document</term>
<term>Record format</term>
<term>Segmentation</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Compression donnée</term>
<term>Format enregistrement</term>
<term>Document imprimé</term>
<term>Langage</term>
<term>Indien</term>
<term>Segmentation</term>
<term>Algorithme</term>
<term>Document image</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Langage</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In this paper, a new compression scheme is presented for Indian Language (IL) textual document images. Since OCR (Optical Character Recognition) technology for IL scripts is not matured enough, transcription of these documents into digital domain needs new techniques that achieve high degree of compression as well as suitable methods to perform various operations like document indexing, retrieval, etc. The proposed method is essentially based on symbolic compression technique, which has been realized with an efficient segmentation-based clustering approach. A soft pattern-matching technique has been implemented using two different feature sets that co-operate each other to build an efficient prototype library. Experiments have been done for documents printed in Devnagari (Hindi) and Bangla scripts, two mostly used script in Indian subcontinent. Test results show that the proposed technique outperforms several standard methods like CCITT Group-4, JBIG, etc. which are frequently used for compression of document images.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Inde</li>
</country>
<region>
<li>Bengale-Occidental</li>
</region>
<settlement>
<li>Calcutta</li>
</settlement>
<orgName>
<li>Institut indien de statistiques</li>
</orgName>
</list>
<tree>
<country name="Inde">
<noRegion>
<name sortKey="Garain, U" sort="Garain, U" uniqKey="Garain U" first="U." last="Garain">U. Garain</name>
</noRegion>
<name sortKey="Chaudhuri, B B" sort="Chaudhuri, B B" uniqKey="Chaudhuri B" first="B. B." last="Chaudhuri">Bidyut Baran Chaudhuri</name>
<name sortKey="Debnath, S" sort="Debnath, S" uniqKey="Debnath S" first="S." last="Debnath">S. Debnath</name>
<name sortKey="Mandal, A" sort="Mandal, A" uniqKey="Mandal A" first="A." last="Mandal">A. Mandal</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001802 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001802 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:05-0040310
   |texte=   Compression of scan-digitized Indian language printed text: A soft pattern matching technique
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024